Goto

Collaborating Authors

 Southampton






Royal Navy returns to wind power with trial of robotic sailboats

New Scientist

Oshen's robotic sailboats are powered by the wind and the sun The UK's Royal Navy may return to the age of sail, with a new demonstration involving a flotilla of small, wind-propelled robot boats. Made by Oshen in Plymouth, UK, the vessels, known as C-Stars, are just 1.2 metres long and weigh around 40 kilos. Solar panels power navigation, communications and sensors, while a sail provides propulsion. Deployed as a constellation, the small vessels act as a wide-area sensor network. How the US military wants to use the world's largest aircraft "The simplest way of describing C-Stars is as self-deploying, station-keeping ocean buoys," says Oshen CEO Anahita Laverack .


MosaicBERT: A Bidirectional Encoder Optimized for Fast Pretraining Jacob Portes

Neural Information Processing Systems

Although BERT -style encoder models are heavily used in NLP research, many researchers do not pretrain their own BERTs from scratch due to the high cost of training. In the past half-decade since BERT first rose to prominence, many advances have been made with other transformer architectures and training configurations that have yet to be systematically incorporated into BERT.


A review of NMF, PLSA, LBA, EMA, and LCA with a focus on the identifiability issue

Qi, Qianqian, van der Heijden, Peter G. M.

arXiv.org Machine Learning

Across fields such as machine learning, social science, geography, considerable attention has been given to models that factorize a nonnegative matrix into the product of two or three matrices, subject to nonnegative or row-sum-to-1 constraints. Although these models are to a large extend similar or even equivalent, they are presented under different names, and their similarity is not well known. This paper highlights similarities among five popular models, latent budget analysis (LBA), latent class analysis (LCA), end-member analysis (EMA), probabilistic latent semantic analysis (PLSA), and nonnegative matrix factorization (NMF). We focus on an essential issue-identifiability-of these models and prove that the solution of LBA, EMA, LCA, PLSA is unique if and only if the solution of NMF is unique. We also provide a brief review for algorithms of these models. We illustrate the models with a time budget dataset from social science, and end the paper with a discussion of closely related models such as archetypal analysis.


Fast Factorized Learning: Powered by In-Memory Database Systems

Stöckl, Bernhard, Schüle, Maximilian E.

arXiv.org Artificial Intelligence

Learning models over factorized joins avoids redundant computations by identifying and pre-computing shared cofactors. Previous work has investigated the performance gain when computing cofactors on traditional disk-based database systems. Due to the absence of published code, the experiments could not be reproduced on in-memory database systems. This work describes the implementation when using cofactors for in-database factorized learning. We benchmark our open-source implementation for learning linear regression on factorized joins with PostgreSQL -- as a disk-based database system -- and HyPer -- as an in-memory engine. The evaluation shows a performance gain of factorized learning on in-memory database systems by 70\% to non-factorized learning and by a factor of 100 compared to disk-based database systems. Thus, modern database engines can contribute to the machine learning pipeline by pre-computing aggregates prior to data extraction to accelerate training.


Automated Data Enrichment using Confidence-Aware Fine-Grained Debate among Open-Source LLMs for Mental Health and Online Safety

Mao, Junyu, Hills, Anthony, Tseriotou, Talia, Liakata, Maria, Shamir, Aya, Sayda, Dan, Atzil-Slonim, Dana, Djohari, Natalie, Mandal, Arpan, Roth, Silke, Ugwudike, Pamela, Niranjan, Mahesan, Middleton, Stuart E.

arXiv.org Artificial Intelligence

Real-world indicators are important for improving natural language processing (NLP) tasks such as life events for mental health analysis and risky behaviour for online safety, yet labelling such information in NLP training datasets is often costly and/or difficult given the dynamic nature of such events. This paper compares several LLM-based data enrichment methods and introduces a novel Confidence-Aware Fine-Grained Debate (CFD) framework in which multiple LLM agents simulate human annotators and exchange fine-grained evidence to reach consensus. We describe two new expert-annotated datasets, a mental health Reddit wellbeing dataset and an online safety Facebook sharenting risk dataset. Our CFD framework achieves the most robust data enrichment performance compared to a range of baselines and we show that this type of data enrichment consistently improves downstream tasks. Enriched features incorporated via debate transcripts yield the largest gains, outperforming the non-enriched baseline by 10.1% for the online safety task.


Variance Matters: Improving Domain Adaptation via Stratified Sampling

Napoli, Andrea, White, Paul

arXiv.org Artificial Intelligence

Domain shift remains a key challenge in deploying machine learning models to the real world. Unsupervised domain adaptation (UDA) aims to address this by minimising domain discrepancy during training, but the discrepancy estimates suffer from high variance in stochastic settings, which can stifle the theoretical benefits of the method. This paper proposes Variance-Reduced Domain Adaptation via Stratified Sampling (VaRDASS), the first specialised stochastic variance reduction technique for UDA. We consider two specific discrepancy measures -- correlation alignment and the maximum mean discrepancy (MMD) -- and derive ad hoc stratification objectives for these terms. We then present expected and worst-case error bounds, and prove that our proposed objective for the MMD is theoretically optimal (i.e., minimises the variance) under certain assumptions. Finally, a practical k-means style optimisation algorithm is introduced and analysed. Experiments on three domain shift datasets demonstrate improved discrepancy estimation accuracy and target domain performance.